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Abstract 

In the problem of model selection for a given family of linear estima¬ 
tors, ordered by their variance, we offer a new “smallest accepted” ap¬ 
proach motivated by Lepski’s method and multiple testing theory. The 
procedure selects the smallest model which satisfies an acceptance rule 
based on comparison with all larger models. The method is completely 
data-driven and does not use any prior information about the variance 
structure of the noise: its parameters are adjusted to the underlying pos¬ 
sibly heterogeneous noise by the so-called “propagation condition” using 
a wild bootstrap method. The validity of the bootstrap calibration is 
proved for finite samples with an explicit error bound. We provide a com¬ 
prehensive theoretical study of the method and describe in detail the set 
of possible values of the selector m. We also establish some precise ora¬ 
cle error bounds for the corresponding estimator 9 ~ 6fj^ which equally 
applies to estimation of the whole parameter vectors, some subvector or 
linear mapping, as well as the estimation of a linear functional. 
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1 Introduction 

Model selection is one of the key topics in mathematical statistics. A choice between 
models of differing complexity can often be viewed as a trade-off between overfitting the 
data by choosing a model which has too many degrees of freedom and smoothing out 
the underlying structure in the data by choosing a model which has too few degrees of 
freedom. This trade-off which shows up in most methods as the classical bias-variance 
trade-off is at the heart of every model selection method (as for example in unbiased risk 
estimation, Kneip (1994) or in penalized model selection, Barron et al. (1999), Massart 
(2007)). This is also the case in Lepski’s method, Lepski (1990), Lepski (1991), Lepski 
(1992), Lepski and Spokoiny (1997), Lepski et al. (1997), Birge (2001) and risk hull min¬ 
imization, Cavalier and Golubev (2006). Many of these methods allow their strongest 
theoretical results only for highly idealized situations (for example sequence space mod¬ 
els), are very specific to the type of problem under consideration (for instance, signal or 
functional estimation), require to know the noise behavior (like homogeneity) and the 
exact noise level. Moreover, they typically involve an unwieldy number of calibration 
constants whose choice is crucial to the applicability of the method and is not addressed 
by the theoretical considerations. For instance, any Lepski-type method requires to fix a 
numerical constant in the definition of the threshold, the theoretical results only apply 
if this constant is sufficiently large while the numerical results benefit from the choice 
of a rather small constant. Spokoiny and Vial (2009) offered a propagation approach to 
calibration of Lepski’s method in the case of the estimation of a one-dimensional quantity 
of interest. However, the proposal still requires the exact knowledge of the noise level 
and only applies to linear functional estimation. A similar approach has been applied to 
local constant density estimation with sup-norm risk in Gach et al. (2013) and to local 
quantile estimation in Spokoiny et al. (2013). 

In the case of unknown but homogeneous noise, generalized cross validation can be 
used instead of unbiased risk estimation method. For the penalized model selection, 
recently a number of proposals appeared to apply one or another resampling method. 
Arlot (2009) suggested the use of resampling methods for the choice of an optimal penal¬ 
ization, following the framework of penalized model selection, Barron et al. (1999), Birge 
and Massart (2007). The validity of a bootstrapping procedure for Lepski’s method has 
also been studied in Chernozhukov et al. (2014) with new innovative technical tools with 
applications to honest adaptive confidence bands. 

An alternative approach to adaptive estimation is based on aggregation of different 
estimates; see Goldenshluger (2009) and Dalalyan and Salmon (2012) for an overview 
of the existing results. However, the proposed aggregation procedures either require 
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two independent copies of the data or involves a data splitting for estimating the noise 
variance. Each of these requirements is very restrictive for practical applications. 

Another point to mention is that the majority of the obtained results on adaptive 
estimation focus on the quality of estimating the unknown response, that is, the loss 
is measured by the difference between the true response and its estimate. At the same 
time, inference questions like confidence estimation would require to know some addi¬ 
tional information about the right model parameter. Only few results address the issue 
of estimating the true (oracle) model. Moreover, there are some negative results show¬ 
ing that a construction of adaptive honest confidence sets is impossible without special 
conditions like self-similarity; see, e.g. Gine and Nickl (2010). 

This paper aims at developing a unified approach to the problem of ordered model 
selection with the focus on the quality of model selection rather than on accuracy of 
adaptive estimation under realistic assumptions on the model. Our setup covers linear 
regression and linear inverse problems, and equally applies to estimation of the whole 
parameter vectors, a subvector or linear mapping, as well as estimation of a linear func¬ 
tional. The proposed procedure and the theoretical study are also unified and do not 
distinguish between models and problems. In the case of a linear inverse problem, it is 
applicable to mild and severely ill-posed problems without prior knowledge of the type 
and degree of ill-posedness; cf. Tsybakov (2000), Cavalier et al. (2002). Another impor¬ 
tant issue is that the procedure does not use any prior information about the variance 
structure of the noise under assumption of minimal Holder smoothness 1/4 on the un¬ 
derlying function. The method automatically adjusts the parameters to the underlying 
possibly heterogeneous noise: the resampling technique allows to achieve the same qual¬ 
ity of estimation as if the noise structure were precisely known. Also we allow for a 
model misspecification: the linear structure of the response can be violated, in this case 
the procedure adaptively recovers the best linear projection. 

Consider a linear model Y = 0* -|- e in iR" for an unknown parameter vector 

6 * G ]RP and a given pxn design matrix ^ . Suppose that a family of linear smoothers 

— SmY is given, where Sm is for each m G M a given pxn matrix. We also assume 
that this family is ordered by the complexity of the method. The task is to develop a 
data-based model selector fh which performs nearly as good as the optimal choice, which 
depends on the model and is not available. The proposed procedure called the “smallest 
accepted” (SmA) rule can be viewed as a calibrated Lepski-type method. The idea how 
the parameters of the method can be tuned, originates from Spokoiny and Vial (2009) 
and is related to a multiple testing problem. The whole procedure is based on family of 
pairwise tests, each model is tested against all larger ones. Finally the smallest accepted 
model is selected. The critical values for this multiple testing procedure are fixed using 
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the so-called propagation condition. Theorem 2.1 presents finite sample results on the 
behavior of the proposed selector m and the corresponding estimator 6 = Of^ . In 
particular, it describes a concentration set M° for the selected index fh and states an 
oracle bound for the resulting estimator 6 = Of^ ■ Usual rate results can be easily derived 
from these statements. Further results address the important issue called “the payment 
for adaptation” which can be defined as the gap between oracle and adaptive bounds. 
Theorem 2.2 gives a general description of this quantity. Then we specify the results to 
important special cases like projection estimation and estimation of a linear functional. It 
appears, that in some cases the obtained results yield sharp asymptotic bounds. In some 
other cases they lead to the usual log-price for adaptation. However, all these results 
require a known noise distribution. Section 3 explains how the proposed procedure can 
be tuned in the case of unknown noise using a bootstrap procedure. We establish explicit 
error bounds on the accuracy of the bootstrap approximation and show that the procedure 
with bootstrap tuning does essentially the same job as the ideal procedure designed for 
the known noise. The study is quite involved because the procedure uses the same data 
twice for parameter tuning and for model selection. 

The paper is structured as follows. The next section presents the procedure and the 
results for an idealistic situation when the noise distribution is precisely known. We in¬ 
troduce the SmA selector fh and explain how it can be calibrated. Then we describe the 
set of possible m-values and establish probabilistic oracle bounds. Section 2.6 explains 
how the method and the results can be extended to the case of a polynomial loss func¬ 
tion. The results are also specified to the particular problems of projection and linear 
functional estimation. Section 3 extends the method and the study to the realistic case 
with unknown heteroscedastic noise by using a resampling technique. The proofs and a 
detailed study of the bootstrap procedure in the linear Gaussian setup are given in the 
appendix. We also collect there some useful technical statements for Gaussian quadratic 
forms and sums of random matrices. 


2 Model and problem. Known noise variance 

This section presents the model selector for the idealistic case when the noise distribution 
is precisely known. In the next section we explain how the unknown noise structure can 
be recovered from the data using a resampling technique. First we specify our setup. We 
consider the following linear Gaussian model: 


Yi = e* 


+ 


Ei ~ h[(0,o-^) i.i.d. , 


i = 1,... ,n. 


( 2 . 1 ) 
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with given design ,..., in JR^ . We also write this equation in vector form Y = 
6* -\-e G -ZR" , where ^ is pxn design matrix and e ~ ^(0, cr^/n) • Below we assume a 
deterministic design, otherwise one can understand the results conditioned on the design 
realization. 

In what follows, we allow the model (2.1) to be completely misspecified. We mainly 
assume that the observations Yi are independent and define the response vector f* = 
lEY with entries fi. Such a model can be written as 

Yi = fi + £i ■ (2-2) 

Our study allows that the linear parametric assumption f* = 6* is violated, and 

the underlying noise e = (ej) can be heterogeneous and non-Gaussian. However, in 
this section we assume the noise distribution to be known. The main oracle results of 
Theorem 2.1 below do not require any further conditions on the noise. Some upper 
bounds on the quantities 3 ^. entering in the oracle bounds are established under i.i.d. 
Gaussian noise, but can be easily extended to non-Gaussian heterogeneous noise under 
moment conditions. For the linear model (2.2), define 0* G IRF as the vector providing 
the best linear fit: 


e* argminlFlIl^ - V/T 

e 

As usual, a pseudo-inversion is assumed if the matrix is degenerated. 

Below we assume a family {0m} of linear estimators Om = SmY of 6* to be 
given. Typical examples include projection estimation on an m-dimensional subspace or 
regularized estimation with a regularization parameter am, penalized estimators with a 
quadratic penalty function, etc. To include specific problems like subvector/functional 
estimation, we also introduce a weighting q x p-matrix W for some fixed q > 1 and 
define quadratic loss and risk with this weighting matrix W: 

Qm =' \\W{em - e*)f, Olm iE||VF(0m " 0*)f • 

Of course, the loss and the risk depend on the choice of W . We do not indicate this 
dependence explicitly but it is important to keep in mind the role of W in the definition 
of Qm • Typical examples of W are as follows. 

Estimation of the whole vector 6* Let W be the identity matrix W = Ip with 
q = p ■ This means that the estimation loss is measured by the usual squared Euclidean 
norm ||0m — 0*|P ■ 
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Prediction Let W be the square root of the total Fisher information matrix F = 
cr- 2 i 27 i^T ^ _ Y , Such a type of loss is usually referred to as prediction loss 

because it measures the fit and the prediction ability of the true model by the model 
with the parameter 0 . 

Semiparametric estimation Let the target of estimation not be the whole vector 
0* but some subvector 0 q of dimension q. The estimate HOm is called the profile 
maximum likelihood estimate. The matrix W can be defined as the projector Uq on 
the 9q subspace. The corresponding loss is equal to the squared Euclidean norm in this 
subspace: 

Qm = \\no{em-e*)f. 

Alternatively, one can select W‘^ as the efficient Fisher information matrix defined by 
the relation 


¥ = {no¥-^n^) \ 

Linear functional estimation The choice of the weighting matrix W can be adjusted 
to address the problem of estimating some functionals of the whole parameter 0* . 

In all cases, the most important feature of the estimators 9^. is linearity. It greatly 
simplifies the study of their properties including the prominent bias-variance decompo¬ 
sition of the risk of 9m ■ Namely, for the model (2.2) with JEe = 0, it holds 

lE9m = 9*m= Smf\ 

= \\W{9*m-9*)f+ti{WSm^^^{e)SlW^] 

= \\W{Sm-S)rf+ti{WSmyav{s)SlW^}. (2.3) 

The optimal choice of the parameter m can be defined by risk minimization: 

m* argminfkm. 

mSM 

The model selection problem can be described as the choice of m by data which mimics 
the Oracle, that is, we aim at constructing a selector m leading to the adaptive estimate 
9 = 9m with properties similar to the oracle estimate 9m* ■ 

Below we discuss the ordered case. The parameter m G M is treated as complexity 
of the method 9m ■ In some cases the set M of possible m choices can be countable 
and/or continuous and even unbounded. For simplicity of presentation, we assume that 
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M is a finite set of positive numbers, |M| stands for its cardinality. Typical examples 
are given by the number of terms in the Fourier expansion, or by the bandwidth in the 
kernel smoothing. In general, complexity can be naturally expressed via the variance 
of the stochastic term of the estimator '■ the larger m , the larger is the variance 
\ax{W6m) ■ In the case of projection estimation with m-dimensional projectors Sm, 
this variance is linear in m , Var(0m) = . In general, dependence of the variance 

term on m may be more complicated but the monotonicity constraint (2.4) has to be 
preserved. 

Further, it is implicitly assumed that the bias term ||IF(0* — becomes small 

when m increases. The smallest index m = mo corresponds to the simplest (zero) model 
with probably a large bias, while m large ensures a good approximation quality 0*^ ~ 9* 
and a small bias at cost of a big complexity measured by the variance term. In the case 
of projection estimation, the bias term in (2.3) describes the accuracy of approximating 
the response f* by an m-dimensional linear subspace and this approximation improves 
as m grows. However, in general, in constrast to the case of projection estimation, one 
cannot require that the bias term ||VF(0* — 9*^)\\ monotonously decreases with m . One 
example is given by an estimation-at-a-point problem. 

Example 2.1. Suppose that a signal 6* is observed with noise: Yi = 0* + £j . Consider 

~ def 

the set of projection estimates 6m on the first m coordinates and the target is 0* = 
WO = 6j . If 9* is composed of alternating blocks of 1 ’s and — 1 ’s with equal length, 
then the bias \4>* — (pml for (pm = Ylj<m monotonous in m. 

2.1 Smallest accepted (SmA) method in ordered model selection 

First we recall our setup. Due to the linear structure of the estimators 9m = <SmY and 
of the loss function W, one can consider = JCmY with JCm = WSm- —?■ , 

m G M, as a family of linear estimators of the g-dimensional target of estimation 
cf)* = W9* = WSf = %f for % = WS. 

Now we discuss a general approach to model selection problems based on multiple 
testing. Suppose that the given family {(pm , w- G M} of estimators is naturally ordered 
by their complexity (variance). Due to (2.3), this condition can be written as 


%m Var(£) < "Xm' Var(e) A, 


■T 
m' ’ 


m! > m. 


(2.4) 


One would like to pick up a smallest possible index m G M which still provides a 
reasonable fit. The latter means that the bias component 


\\bm 



* 

'm 


(t)*f = \\{%m-%)rf 


2 
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in the risk decomposition (2.3) is not significantly larger than the variance 

tr{Var(^^)} = tr{3Cm Var(£) 3C^}. 

If m° G M is such a “good” choice, then our ordering assumption yields that a further 
increase of the index m over m° only increases the complexity (variance) of the method 
without real gain in the quality of approximation. This latter fact can be interpreted 
in term of pairwise comparison: whatever m G M with m > m° we take, there is no 
significant bias reduction in using a larger model m instead of m° . This leads to a 
multiple testing procedure: for each pair m > m° from M, we consider a hypothesis of 
no significant bias between the models m° and m , and let Tm,m° be the corresponding 
test. The model m° is accepted if Tm,m° = 0 for all m > m° . Finally, the selected 
model is the “smallest accepted”: 

fh argmin|m° G M: Tm,m° = 0, Vm > m°|. 

Usually the test Tm,m° can be written in the form 

for some test statistics Tm,m° and for critical values im,m° ■ The information-based 
criteria like AIC or BIG use the likelihood ratio test statistics Tm,m° = [Om — 

A great advantage of such tests is that the test statistic Tm,m° is pivotal 
(with m — m° degrees of freedom) under the correct null hypothesis, this makes 
it simple to compute the corresponding critical values. Below we apply another choice 
corresponding to Lepski-type procedure and based on the norm of differences — (p^o : 

~ ~ ^pf 

Tm,m° = ll<^m “ II = — ^m° ■ 

The main issue for such a method is a proper choice of the critical values 3m,m° • One 
can say that the procedure is specified by a way of selecting these critical values. Below 
we offer a novel way of carrying out this choice in a general situation by using a so- 
called propagation condition-, if a model m° is “good” it has to be accepted with a high 
probability. This rule can be seen as an analog of the family-wise level condition in a 
multiple testing problem. Rejecting a “good” model is the family-wise error of first kind, 
and this error has to be controlled. 
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2.2 Oracle choice 

To specify precisely the meaning of a good model, we use below for each pair m > m° 
from M the decomposition 

~ +^)|| = II II i (2-5) 

where with 'Xm,m° = 


— ^171,171° J 


r»ra^m° 


def ^ 
— A,7i 


oOS. 


We also define 




— '^mj 1 


- .A^rrife. 


It obviously holds = 0. Introduce the q x q -matrix Vm,m° as the variance of 

4*m° • 


— ^ar((^^ cf)^a^ — Var(3Cm,m°^) — Var(£) 


If the noise e is homogeneous with Var(e) = a'^In , it holds 


^m,m° — ^ ^m,m°^rnm°' 


Further, 


^"^ 171 , 771 ° — ||^m,m°|| T '®'ll^m,m°ll — ||^m,m°|| T Pm,m°; (2-6) 

Pm,m° tr(Vm,m°) = . 

The bias term 6m,m° is significant if its squared norm is competitive with 

the variance term Pm,m° = ti'(Vm,m°) • We say that m° is a “good” choice if there is no 
significant bias 6m,m° for any m > m° . This condition can be quantihed in the following 
“bias-variance trade-off”: 


||6m,m°|| ^ /3 Pm,m° i TTl ^ TTl 

for a given parameter (3 which controls the bias component in the risk due to decompo¬ 
sition (2.6). Now define the oracle m* as the minimal m° with the property (2.7): 

m* min|m°: max {||6m,m°||^ -/3^Pm,m°} < o|. (2.8) 

L 77l>7n° ) 


SPOKOINY, V.AND WILLRICH, N. 


11 


2.3 Tail function, multiplicity correction, critical values ^rn,m° 

Now we explain a possible choice of critical values im,m° in the situation when the 
noise distribution is known. A particular example is the case of Gaussian errors e ~ 
N(0,cr^/,i). Then the distribution of the stochastic component ^rnm° is known as well. 
In the Gaussian case, it is N(0,Vm,m°) with the covariance matrix Ym,m° ■ Introduce 
for each pair m > m° from M a tail function Zm,m° (t) of the argument t such that 

^ = G . (2-9) 

Here we assume that the distribution of ||^mm°ll i® continuous and the value 

is well defined. Otherwise one has to define Zm,m° {i) as the smallest value for which the 

error probability is smaller than e“*. 

For checking the propagation condition, we need a uniform in m > m° version of the 
probability bound (2.9). Let 

|m G M: m > m°}. 

Given x, by qm° = qm° (x) denote the corresponding multiplicity correction: 

^ (x + )} ^ =6 . (2-10) 

A simple way of computing the multiplicity correction qm° is based on the Bonferroni 
bound: qm° = log(#M'’'(m°)). However, it is well known that the Bonferroni bound is 
very conservative and leads to a large correction q^o , especially if the random vectors 
are strongly correlated. This is exactly the case under consideration. Note that the 
joint distribution of the ’s is precisely known. This allows to define the correction 

qm° = Q'm°(x) just by condition (2.10). Finally we define the critical values ^rn,m° by 
one more correction for the bias: 

im,m° — (x + qm°) T (2-11) 

for Pm,m° = ti'(Vm,m°) • This definition still involves two numerical tuning constants x 
and /3. The first value x controls the nominal rejection probability under the null, a 
usual choice x = 3 does a good job in most of cases. The value /3 controls the amount 
of admissible bias in the definition of a good choice; cf. (2.7) and (2.8). This value is 
mainly for theoretical study, in practice one can always take /3 = 0. 
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2.4 SmA choice and the oracle inequality 

Define the selector in by the “smallest accepted” (SmA) rule. Namely, with ^rn,m° from 
(2.11), the acceptance rule reads as follows: 


|m° is accepted) ) max |Tmm° 



The SmA rule is 


in = “smallest accepted” 

= min-^m°: max - 3 m,m°} < 0). (2.12) 

Our study mainly focuses on the behavior of the selector m. The performance of the 
resulting estimator cf) = is a kind of corollary from statements about the selected 
model in . The ideal solution would be m = m* , then the adaptive estimator cf) coincides 
with the oracle estimate (f)^* . 

The bound (2.9) automatically ensures the desired propagation property: any good 
model m° in the sense (2.7) will be accepted with probability at least 1 — e”’^. In some 
sense, this property is built-in by the construction of the procedure. By definition, the 
oracle m* is also a “good” choice, this yields 

lP{m* is rejected) < e“^. (2.13) 

Therefore, the selector in typically takes its value in M“(m*), where 

M“(m*) = {m G M: m < m*} 

is the set of all models in M smaller than m* . It remains to check the performance 
of the method in this region. The next step is to specify a subset M° of M“(m*) of 
highly probable m-values. We will refer to this subset as the zone of insensitivity. The 
definition of m* implies that there is a significant bias for each m G M“(m*). If this 
bias is really large, then, again, the probability of selecting m can be bounded from 
above by a small value. Therefore, the zone of insensitivity is composed of m-values for 
which the bias is significant but not very large. 

Now we present a formal description which specifies a subset of M“(m*) for 
which the bias ||6m*,m|| is sufficiently large and hence, the probability of the event {m G 
is negligible. 

Theorem 2.1. Let Zm,m°{') be the tail function from (2.9) for each pair m > m° G M . 
Given x and (3, let im,m° be given by (2.10) and (2.11). Then the propagation property 
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(2.13) is fulfilled for the SmA selector m . Moreover, for any subset C M (m*) s.t. 

> im* ,m “1“ ^m*,m(d^s)i ^ ^ j 

def 

for Xs = x + log(|M'^|) with |M'^| being the cardinality of "MF, it holds 

(2.15) 

(2.16) 

2 e“^ 

(2.17) 

Remark 2.1. Note that the choice Xs = x + log(|M'^|) relies on crude Bonferroni argu¬ 
ments and the definition of can be refined by choosing Xg more carefully. However, 
this value only enters in the theoretical bound and is not used in the procedure, a fine 
tuning for this value is not required. Obviously Xg < x -|- log(|M“(m*)|). 

Remark 2.2. The result (2.17) is called the oracle bound because it compares the loss of 
the data-driven selector fh and of the optimal choice m* . The value in (2.16) can 
be viewed as a “payment for adaptation”. An interesting feature of the presented result 
is that not only the oracle quality but also the payment for adaptation depend upon 
the unknown response f* and the corresponding oracle choice m* . In the worst case 
of a model with a flat risk profile tRrn, the set M° can coincide with the whole range 
M“(m*). Even in this case the bounds (2.15) and (2.17) are meaningful. However, 
the payment for adaptation 3 ^* in this case can be larger than the oracle risk. In the 
contrary, if the risk function fkm grows rapidly as m decreases below m* , then the set 
M° is small and the value 3 ,^* is much smaller than the oracle risk ■ 

2.5 Analysis of the payment for adaptation 3 ,^* 

Here we present an upper bound on 3 ^, for a special case of Gaussian independent errors 
Ej. The beneht of considering the Gaussian case is that each vector ^ is Gaussian 
as well, which simplifies the analysis of the tail function Zm',m{') ■ However, the results 


lP{m e M'^) < 


The SmA estimator satisfies the following bound: 




def 


where 3 ^. is defined with M° = M (m*) \ as 


— def 

3m* ~ max 3m*,m ' 
mEM° 


This implies the probabilistic oracle bound: with probability at least 1 — 
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can be extended to non-Gaussian errors e* under exponential moment conditions. Below 
mo denotes the smallest model in M . Writing Ym <7^ Xm define 


def 
Pm = 

. def 
'^m — 


tr(Vm) 

llVmllop. 


Theorem 2.2. Assume the conditions of Theorem 2.1. Let also Pm,m° = tr(Vm,m°) 
and \m,m° — ||V}7ijm°||op with Y m,m° — Satisfy Pm*,m ^ Pm*,mo — Pm* and 

Am*,m < ^m*,mo < -^m* for all mo < m < m* . If the errors e* are normal zero mean 
then the critical values im,m° given by (2.11) satisfy 

(1 /3)\/Pm,m° m,m° {x-plog(|M|)} , 

while the payment for adaptation 3 ^* follows the bound 

3m* ^ (1 + l^)y/Pm* ,mQ m* ,mo {x-Plog(|M (m*)|)} 

< (1 -|- /3)-y/Pm* + \/2Am*{x -|- log(|M|)} . 


Some special cases of this result for projection and linear functional estimation will 
be discussed in Sections 2.7 and 2.8 below. 


2.6 Power loss function 

The probabilistic oracle bound of Theorem 2.1 provides some statement about typical 
behavior of the adaptive SmA estimate (f) = . Unfortunately, this bound does not 

yield a risk bound for quadratic or polynomial losses: even if big losses occur with a small 
probability, the related risk can still be large. It happens that the SmA procedure can 
be easily tuned to secure an oracle risk bound. 

For simplicity of notation, we only consider the quadratic risk 

We aim at comparing the risk of the SmA procedure with the risk Xm* of the oracle 
estimate 0^* . Recall the representation 

Xm ]E\\^m - 0II' = ^ll^mf + ll&mf = Pm + \\bmf 

with Pm = tr(Vm) and Ym = Var(^m) • For our analysis, we have to slightly modify the 
definition of the oracle (2.8). Namely, to ensure an oracle risk bound, we require that 
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not only the model m* is “good” but also all the larger models m > m* are “good” as 
well: 


m* =minim°: max {||6m',m|P -< 0 (2-18) 

m',mS3VC+(m°): m'>m J 

Below we also suppose that the bias component ||6m|P fulfills 

llfemll < Il&m*||, m>m*. (2.19) 

Otherwise, one can define ||6m*|| {m*) ll^mll • 

The choice of the critical values 3m',m for the SmA procedure has to be slightly 
changed to ensure a risk bound for quadratic loss. For this, we need a bit more detailed 
analysis of the SmA procedure in the propagation zone m > m* . In this zone the 
variance dominates the bias, therefore, the SmA procedure can be tuned in the situation 
when there is no signal and hence no bias at all: 

I'm',™ ||*/*m' ^m|| • 

The analysis is based on a simple but important observation that if fh = m > m* , then 
the good model m° = rny_i\ is rejected, where denotes the next smaller model 

with respect to m. The latter means that at least one check based on Tm',m(_i) fails. 
The same can be expressed as follows: the maximum of the r.v.’s Tm'.mf.i) > 

3m',m(_i)) is positive. For a formal description, introduce for each m and x a random 
event 


Am(x) ll( inax {||C',mll > O) 

Vm'eM+(m) / 


on which at least one of the test statistics Tm',m = ||C 


m' ,m I 


exceeds the critical value 




m' ,m 


(x). The case of probabilistic loss focuses on the probability of this event, the value 
X is selected to make it small enough. Now, under the polynomial loss function, we need 
a bound for the moment of the corresponding loss. Namely, for each m , consider the 
expectation of Pm^H^mlP on the random set Am(_i)(x) : 


3?+(x) IE V1)1I 


max 111^,. 


F-i)' 




h-1)' 


c)} >0 


Similarly one can consider any other power loss function by replacing (pm^'^^ 11 Cm 11)^ with 
(Pm IlCmll)'^ • particular, q = 0 yields the probability loss considered before. 

Now we define the value Xm(_i) in such a way that the related deviation risk 1R+(x) 
can be controlled from above. Let am be a given decreasing sequence. Its choice will be 
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discussed below. We fix for each m the value such that 




h-i)^ 

It implies 

^ — ClmPm ) 

(x)^ ^ Otm ■ 

Now define the critical values irn,m° of the SmA procedure as 

hm,m° — ^m,m°(Xm°) “1“ PPmm°' 


( 2 . 20 ) 


( 2 . 21 ) 


( 2 . 22 ) 


The resulting procedure reads exactly as in the case of probabilistic loss: 

m = mini m°: max - lm,m° } < 0 L (2.23) 

It is worth mentioning that the procedure is the same, and even the critical values im,m° 
are given by the same formula, as in the case of probabilistic loss. The only difference is 
in the propagation condition (2.20) which is a bit stronger than a similar condition for 
indicator loss. This implies that the values Xm° and 3m,m° are a bit larger in the case 
of a power loss function. 

Theorem 2.3. Let the SmA procedure (2.23) be applied with the critical values im,m° 
from (2.22), where the values Xm are defined by (2.20) with the coefficients am satisfying 

^ ^ OlmPm ^ 0:m*Pm* (2.24) 

mSM+(m*) 

for some am* ■ If the errors £i are normal zero mean, then 

< 2am‘iKm‘ + (2.25) 


where 

- def 

3m* ~ max ^m* ,rn • 
mGM~(m*) 

Similarly to the probabilistic loss function, the result can be refined by considering 
the zone of insensitivity in the region m < m* . 

Now we briefly discuss the choice of constants Om entering into (2.24). Suppose that 
the Pm’s satisfy 

'y ^ (Pm*/Pm) ^ C 

meM+(m*) 


(2.26) 
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for some o > 0 and a fixed constant C . Then one can take 

— (Pm/Pmo) 

Below we focus on a situation when the effective dimension grows exponentially with 
m. Note that this situation is typical in model selection and often one can reduce the 
general case to this one by a proper discretization. Then (2.26) is fulfilled for any o > 0 
with C = C (a) . 

The further step is an upper bound on the values x^,, Zm,m°{^m), and }rn,m° , as well 
as on the payment for adaptation 3 ^* . These bounds require some exponential moment 
conditions on the errors e*. To reduce the computational burden, we again focus on the 
case of Gaussian errors. 

Proposition 2.4. Suppose (2.26) for a > 0. If the errors £i are normal zero mean, 
then the choice 

— V^(pm/Pmo) ) — 2(1 + a) log(pm/pm(,), (2.27) 

ensures conditions (2.24), (2.20), and therefore, the oracle bound (2.25) with am* = 
\/3C (pmo /Pm*) ■ Furthermore, 

Im* < + \/2\m*{2{l + a) log(p.m*/pmo) + log(|M|)}. (2.28) 

2.7 Application to projection estimation 

An important feature of the obtained oracle statements is their universality: they equally 
apply to various setups and problems and provide some quantitative explicit error bounds 
even for finite samples. Below we briefly comment on two popular cases of projection 
estimation and estimation of a linear functional. In some sense, these are two extreme 
cases of relation between pm* and Am* • 

This section discusses the case of projection estimation in the linear model Y = 
ipT Q* _j_ g. homogeneous errors e*: Var(ei) = . All the conclusions can be easily 

extended to heterogeneous errors whose variances are contained in some hxed interval. 
We also focus on probabilistic loss, the case of polynomial loss can be considered in the 
same way. 

Let us assume an ordering on the features of Fm and let for each m G N denote Fm 
as the submatrix F corresponding to first m features, i. e. the projector onto the first 
m features. We use m to denote the model and the number of features. The related 
estimator Om is the standard LSE with Sm = {^mFm) ^Fm and the prediction problem 
with W = F~^ yields XmY = IlmY where Ilm = F^(FmF^) ^Fm is the projector 
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onto the corresponding feature subspace. For homogeneous errors e* with Var(ej) = cr^ , 
the variance Ym = Yaii^UmY) satishes 

Pm = tr{Var(77ml^)} = cr^tr(77m) = cr'^m. 

Moreover, for each pair m > m° , it holds 

(^m — (^m° ) = — ilm° ) ^ = nm,m° Y , 

where njn,m° projects on the subspace of features entering in m but not in m° . 

Corollary 2.5. Consider the problem of projection estimation with homogeneous Gaus¬ 
sian errors Si and probabilistic loss. Then Pm,m° = (T^(m — m°), \m,m° = ? o,nd 

lm,m° < ct (1 -|- /3)\/m — m° -|- cr^/ 2 x -f 21 og(|M|), 

Jm* < ct(1 -I- (3)'/^ + cr\/2x -I- 21og(|M|). 

The first term in the expression for 3 ^* is of order \/m* and it is a leading one 
provided that the effective dimension m* is essentially larger than log(|M|). Usually 
the cardinality of the set M is only logarithmic in the sample size n ; cf. Lepski (1991); 
Lepski et al. (1997). Then log(|M|) Ri loglogn and 3m* ~ cry/m* for m* ^ loglogn. 
For the oracle risk IRm* , it holds IRm* = Pm* + \\bm* IP > a'^m* . Therefore, the payment 
for adaptation 3 ^, is of the same order as the square root of the oracle risk, and the 
result of Proposition 2.2 has a surprising corollary: rate adaptive estimation is possible 
if the oracle dimension m* is significantly larger than log log n . 

Remark 2.3. The payment for adaptation can be drastically reduced in the situations 
with a narrow zone of insensitivity. If the bias grows rapidly when m decreases from 
m* to mo, more precisely, if ||l)m*,m|P > Ca^(rn* — m-|-2x-|-2 log(|M|)) for some fixed 
constant C and all m < m° with m° < m* , then 

3m* < cr(l -I- l3)'fm* -m° + crv^2x -f 21og(|M|). 

So, if {m* — m°)/m* is small, the payment for adaptation is smaller in order than the 
oracle risk, and the procedure is sharp adaptive. In particular, one can easily see that 
the self-similarity condition of Gine and Nickl (2010) ensures a rapid growth of the bias 
when the index m becomes smaller than m* . This in turn yields a narrow zone of 
insensitivity and hence, a sharp adaptive estimation. 

Remark 2.4. It is worth mentioning the relation of the proposed procedure to the 
popular Akaike (AIC) criterion. AIC defines fh by minimizing 

m = argminjllK — Tlm^lP + 2 cT^m}. 
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One can easily see that this rule is equivalent to the SmA rule ( 2 . 12 ) with = 

2a‘^{m — m°). However, this choice does not guarantee the propagation condition (2.13). 


2.8 Linear functional estimation 

In this section, we discuss the problem of linear functional estimation. As previously, we 
assume a family of estimators (j)m = , m G M, to be given, where the rank of each 

%m is equal to 1. The ordering condition means that these estimators are ordered by 
their variance: 

v^ Yav{XmY) = %m Var(£) (2.29) 

grows with m. Further, each stochastic component ^m,m° =Xm,m°^ is one-dimensional, 
and it holds 


= Pn 


.0 = V, 


m,m° 


= Xm,m° Var(e) X 


T 


Note that in the case of Gaussian errors, im,m° is also Gaussian: ~ ^(0) v^m°) ■ 

The tail function Zm,m°{'Y) of ^m,m° can be upper-bounded by Vm,m°V^- In the case 
of probabilistic loss, a Bonferroni correction and a bias adjustment lead to the upper 
bound for the critical values 3 m,m° : 

3m,< Vm,m° (/S + \/2x-F 21og(|M|)^, (2.30) 

where |M| is the number of elements in M. This implies 

3m* < Vm*(/3-F V2x-F21og(|M|)^. (2.31) 

Theorem 2.6. Let the errors £{ he Gaussian zero mean. Consider a problem of linear 
functional estimation of cf* = Xf* by a given family 4>m = XmY with rank(Xm) = 
rank(X) = 1, m G M. Then the critical values 3 m,m° from (2.11) fulfill (2.30) and the 
oracle inequality (2.17) holds with the payment for adaptation 3 ^, obeying (2.31). 

Remark 2.5. One can conclude that for the problem of functional estimation with 
probabilistic loss, the squared payment for adaptation 3 ^* is by factor log(|M|) larger 
than the oracle variance v^* . If |M| itself is logarithmic in the sample size n, we end 
up with the extra (log log n)-factor in the accuracy of adaptive estimation. 

In the case of polynomial loss, similar arguments yield due to (2.22) and (2.27) 


3m,m° — ^m,m° (/3 + \/2 + 21 og(|M|)) 


< Vm,m° (/3 -h \/2(l a)log(pm°/Pmo) + 21og(|M|)) 
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Spokoiny and Vial (2009) showed that the bound 3 ^ > C (m° — mo) is necessary 

to ensure a propagation condition for geometrically growing variance pm = . The 

bound (2.30) yields 

3m* < Vm* (/3 + y^2(l -h a) log(v^*/v^J 21og(|M|)). (2.32) 

Theorem 2.7. Suppose that the errors £i are Gaussian zero mean. Let the family of 
functional estimators %mY be such that the variances pm = from (2.29) fulfill the 
condition (2.26) with a > 0 . Then the critical values irn,m° from (2.22) for the SmA 
procedure fulfill (2.30). For the resulting selector in, the oracle inequality (2.25) holds 
and the payment for adaptation 3 ^* follows (2.32). 

Remark 2.6. It appears that polynomial loss yields a larger price for adaptation: 3 ^* x 
log(v^*/v^jj). This conclusion is consistent with the results by Lepski (1992) and 
Cai and Low (2003, 2005) which show that the log-price for adaptation cannot be avoided 
if a polynomial loss function is considered. Our result seems to be even more informative 
because it delivers a non-asymptotic error bound which adapts to the underlying unknown 
model. 

3 Bootstrap tuning 

This section explains how the proposed SmA procedure can be applied if no information 
about the noise e = Y — lEY is available. 

3.1 Presmoothing and wild boostrap 

Let the observed data Y follow the model Y = f* + e with e ~ 11^(0, A), where 
A = diag(fTf,... , 0 "^) is an unknown diagonal covariance matrix. We assume that the 
response vector f* can be well approximated by a linear expansion for a given basis \F 
in the form f* ss 6* . The vector 6* can be naturally treated as target of estimation. 
Assume we are given the ordered family of the estimators (dm) of 0* : 

dm = SmY = {FmFZy^'FmY , m G M. 

For each pair m > m° from M, we consider the test statistic Tm,m° and its decompo¬ 
sition from (2.5): with %m,m° = W{Sm — <Sm°) 

H'm,m° — ||lVm,m°'V|| = \\0Cm,m°{f T£)|| = ||bm,m° 

Calibration of the SmA model selection procedure requires to know the joint distribution 
of all corresponding stochastic terms m° II for m > m° which is uniquely determined 



SPOKOINY, V.AND WILLRICH, N. 


21 


by the noise covariance matrix U . In the case when this matrix is unknown, we are going 
to use a bootstrapping procedure to approximate this distribution. 

The proposed procedure relates to the concept of the wild bootstrap, Wu (1986), 
Beran (1986). In the framework of a regression problem, it suggests to model the unknown 
heteroscedastic noise using randomly weighted residuals from pilot estimation. We apply 
normal weights. For other possible bootstrap weights see for example Mammen (1993). 

Suppose we are given a pilot estimator (presmoothing) / of the response vector 
f* G iR” . Define the residuals: 

Y = Y — f 

This pilot is supposed to undersmooth, that is, the bias is negligible and the variance of 
Y is close to Y . This pre-smoothing requires some minimal smoothness of the regression 
function, and this condition seems to be unavoidable if no information about the noise 
is given: otherwise one cannot distinguish between signal and noise. Below we suppose 
that / is a linear predictor, f = UY , where 77 is a sub-projector in the space TR” . 
For example, one can take 77 = (<7^t where m) is a large model, e.g. 

the largest model M in our collection. 

The wild bootstrap proposes to resample from the heteroscedastic Gaussian noise 
7P^ = 17(0, T) with 


Y = diag(l^ • y), 

where Y Y denotes the coordinate-wise product of the vector Y with itself and diag(l^- 
Y) denotes the diagonal matrix with entries from Y Y. These entries depend on Y 
and thus are random. Therefore, the bootstrap distribution TP^ is a random measure 
on 7R” and the aim of our study is to show that this random measure mimics well the 
underlying data distribution for typical realizations of Y . Clearly diag(l^ -Y) is a very 
bad estimator of the covariance matrix Y . However, below we show that under realistic 
conditions on the pilot / and on the model, it does a good job and allows to obtain 
essentially the same results as in the case of known Y. 

Let denote the n-vector of bootstrap weights ~iN’(0,7„). Clearly the product 
= diag(l^)i(;^ is conditionally on Y normal, 

= diag(l")tu*' | y ~ ^(0, Y). 

Bootstrap analog of ^rnm° — reads dLiag{Y)w^ and 

lld,m°ll = \\%m,mo dmg{Y)w'> 


(3.1) 


22 


Bootstrap-tuned model selection 


The idea is to calibrate the SmA procedure under the bootstrap measure IP^ using 
11^1^ II in place of ||4mm°ll • The bootstrap quantiles are given by analog of 

(2.9): 


IIC,r„°ll 




= e 


—t 


The multiplicity correction = q^o (x) is specified by the condition 


(3.2) 


U {lid,moll ><^o(x + gio)} 

\mSM+(mo) / 


= e 


Finally, the bootstrap critical values are fixed by the analog of (2.11): 

im,m° ^m,m° T ^m° ) T "'JVm,m° 

for =iF^||d,m°lP given by 

Pm°,m trK°,m diag(y • r) 


(3.3) 


Recall that all these quantities are data-driven and depend upon the original data. Now 
we apply the SmA procedure with the critical values defined in such a way. Our 

main result claims that this choice still ensures the propagation condition (2.10) and 
therefore, all the obtained results including the oracle bounds, apply for this choice as 
well; see Theorem 3.2. Moreover, we evaluate the distance between the unknown under¬ 
lying data distribution IP and the bootstrap distribution IP^. The latter is random, 
however, we show that with high probability, it is close to its deterministic counterpart 
]P. To make the results transparent and concise we assume a heterogeneous Gaus¬ 
sian noise e. All the statements can be extended to a non-Gaussian noise under some 
exponential moment conditions at the cost of many technical details. 

Let Q denote the joint distribution of all stochastic vectors entering in the 

decomposition of the test statistics Tm,m° for m > m° . Let also Q'’ be the similar 
distribution of the bootstrapized stochastic vectors entering in the test statistics 

. The next result allows to upper bound the total variation distance between Q 
and Q*' in terms of the following quantities: 

Design Regularity is measured by the value 5^ 

n 

(ii 2 / max \\S~^^‘^'Pi\\ai, where S 'Pi'pJaf ; (3-4) 

i=l 

Obviously 

n n 

^ = tr(^ af) = tr 7^ = p, 

i=l i=\ 
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and therefore in typical situations the value is of order ^fpjn . 
Presmoothing bias for a projector U is described by the vector 


B = - nf*). (3.5) 

We will use the sup-norm ||S||oo = niaxj |6j| and the squared .^ 2 -norm ||S|p = 
^ • bf to measure the bias after presmoothing. 

Stochastic noise after presmoothing is described via the covariance matrix Var(£) 
of the smoothed noise s = — He). Namely, this matrix is assumed to be 

sufficiently close to the unit matrix In , in particular, its diagonal elements should 
be close to one. This is measured by the operator norm of Var(e) — and by 
deviations of the individual variances lE^ from one: 

<5i II Var(e) - /n||op, 

. def _.2 

0 ^ = max|ic/6^ ~ 1|* 
i 

In particular, in the case of homogeneous errors B = a'^In and the smoothing 
operator il as a p-dimensional projector, it holds 

Var(£) = {In -nf =In-n <In, 

~ II Var(£) fn||op — ll-^llop — 1) 

5s = max|iEie? — 1| = max|i7jj|. 
i i 

One can check that Tin x yfpjn for typical smoothing operators like local average 
or kernel smoothing. Similar bounds with an additional constant can be established 
for general regular noise e and a general smoothing operator U. 

Regularity of the smoothing operator 77 is required in Theorem 3.2. This condi¬ 
tion will be expressed via the norm of the rows Tj of the matrix T 
fulfill 


112)^11 z = l,...,n. (3.7) 

This condition is in fact very close to the design regularity condition (3.4). To 
see this, consider the case of a homogeneous noise with B = a^In and 77 = 
i 7^(!7<7^) ^<7. Then T = U and (3.4) implies 

||7;^|| = II!7T(i7!7^) "^17^11 = ||(<7'7^)“^/Vi|| < 5^p . 




24 


Bootstrap-tuned model selection 


In general one can expect that (3.7) is fulfilled with some other constant which 
however, is of the same magnitude as 5\p . For simplicity, we use the same symbol. 

3.2 Bootstrap validation. Range of applicability 

This section states the main results justifying the proposed bootstrap procedure. They 
claim that the joint distribution of the bootstrap stochastic components 
m > m° nicely reproduces the underlying distribution Q of the ’s, and hence, all 

the probabilistic results obtained in Section 2 for known noise continue to apply after 
bootstrap parameter tuning. In the next result, we give a bound on the total variation 
distance ||Q —Q^||tv between Q and Q*'. 

Theorem 3.1. Let Y = f* + e he a Gaussian vector in IR^ with independent com¬ 
ponents, Y ~ for Y = diag(crf,..., cr^) . Let also 'L be a p x n feature 

matrix such that the p x p-matrix S = L' is non-degenerated. For a given pres¬ 

moothing operator LI : IR'^ —)• RF, assume that di from (3.6) satisfies (5i < 1. Let 
, m, m° G M) and let be the joint conditional distribution of the boot¬ 
strap stochastic terms for m, m° G M given the data Y . Then it holds on a 

random set f 22 (x) with JP(l72(x)) > 1 — 3e“^ .• 

||Q-Q^||tv < 2^2(x), (3.8) 

Z\2(x) 2^6lpXn + ^/^ + y/\\B\\^p + 461 \\B\\ (l + \/x). 

where x„ = x -|- log(n), the bias B is given by (3.5) and , 5s by (3.6). 

The result (3.8) gives us a way to control differences Q(^) — Q^(^) for fixed sets 
A. To justify the propagation property for the bootstrap-based set of critical values 
z^^o{x -|- q^o ), given according to (3.1), (3.2), and (3.3) with Y = Y — IJY , we also 
need to take into account the Y -dependence of (x -|- ). This is done by the 

following theorem. 

Theorem 3.2. Assume the conditions of Theorem 3.1, and let the rows Tj of the matrix 
T Y~"^I‘^TIY^I'^ satisfy (3.7). Then for each m° G M 

- im°(x + gio)} > O) < fie"’^ ^ ZIq( x), (3.9) 

where with x„ = x -|- log(n) and Xp = x -|- log(2p) 

4lo(x) = \\B\\‘^ A 5^\\B\\^y2^-\-25\]/Xji-\-5'^Xn A 25\]/yf^ A 25^Xp. 
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The SmA procedure also involves the values Pm,m° ; which are unknown and depend 
on the noise e. The next result shows the bootstrap counterparts can be well 

used in place of pm,m° • 

Theorem 3.3. Assume the conditions of Theorem 3.1. Then it holds on a set f2i(x) 
with x)) > 1 — 3e ^ for all pairs m < m° G M 


where 


Pm,m° 


- 1 


Pm,r. 


< ^p, 


A —^ llBlP + 4x^^^ IIBII + 4x^^^ (5 

XJp — 11-D 11 oo 11 11 'iXjyj On 


m,,m' 


o ; Pm,m° = , and X^ 


+ 4x]vt + de, 

= X + 21og(|M|) . 


The above results immediately imply all the oracle bounds for probabilistic loss of 
Section 2 with the obvious correction of the error terms. 


Now we discuss the sense of the required conditions for bootstrap validity. Our results 
are only meaningful and the bootstrap approximation is accurate if the values Z \2 (x) and 
y^Ao(x) are small. One easily gets 

^2(x) X VP^o(x) < +Sl\\B\\ +S^ + 6e), 


where C is a generic notation for absolute constants and log-terms like x„,Xp etc. So, 
keeping the errors of bootstrap approximation small requires that the values 5^p, S‘^p, 
||B||^p, and 6^ ||S|| are sufficiently small. Now we spell this condition in the typical 
situation with 6,r, x sjpjn and Jg x s/pjn . Then we need that p^log(n)/n is small. 
Further, the bias component does not destroy the bootstrap validity result if the values 
||S||^p and pn~^\\B II < pn ^/^||S||oo are small. If f* is Holder-smooth with the 
parameter s: 


||S||oo < Cp-^ (3.10) 

then the bootstrap procedure is justified for s > 1/4 if p = —)■ oo but p‘^/n ^ 0 as 

n —)> oo . We state one asymptotic result of this sort. 

Corollary 3.4. Assume the conditions of Theorem 3.2 and let p = Pn fulfill p^ log(n)/n —)■ 
0 as n —)■ oo, and (3.10) hold for s > 1/4. Then the results of Theorem 3.1 and 3.2 
apply with a small value An = [^/pn AQ{xn)) V Z\ 2 (x„) —)• 0 as n —>■ 0 . 


4 Simulations 

This section illustrates the performance of the proposed procedure by means of simulated 
examples. We consider a regression problem for an unknown univariate function on [0,1] 





26 


Bootstrap-tuned model selection 



Observed 
Oracle-Est. (16) 
SmA-BS-Est. (14) 
SmA-Est. (14) 



Observed 
Oracle-Est. (13) 
SmA-BS-Est. (11) 
SmA-Est. (11) 


0.00 0.25 



Figure 4.1: True functions and observed values plotted with oracle estimator, the known- 
variance SmA-Estimator (SmA-Est.) and the Bootstrap-SmA-Estimator (SmA-BS-Est.) 
for 3 different functions with different noise structure going from low noise to high noise. 
The numbers in parentheses indicate the chosen model dimension. 


with unknown inhomogeneous noise. The aim is to compare the bootstrap-calibrated 
procedure with the SmA procedure for the known noise and with the oracle estimator. 
We also check the sensitivity of the method to the choice of the presmoothing parameter 
. 

We use a uniform design on [0,1] and the Fourier basis for approximation 

of the regression function / which is modelled in the form 


f{x) = ci'il;i{x) ... -I- Cp-ippix), 


where the (cj)i<j<p are chosen randomly: with •jj i.i.d. standard normal 


7,/(j - 10)2, 11 < i < 200. 


The noise intensity grows from low to high as x increases to one. We use Usim-bs = 
?^sim-theo = ^sim-caiib = 1000 Samples for Computing the bootstrap marginal quantiles and 
the theoretical quantiles and for checking the calibration condition. The maximal model 
dimension is M = 37 and we also choose = 20. The calibration is run with x = 2 
and /? = 1. 

We start by considering examples for W = , i.e. the estimation of the whole func¬ 

tion vector with prediction loss. One can see in Figure 4.1 three examples with different 
intensity of the noise term comparing the Bootstrap-method to the oracle estimator and 
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the known-variance SmA-Method. Figure 4.2 illustrates the dependence of the choice of 
the estimated dimension on our calibration dimension m) and the sample size n . We see 
that in the specific example we are considering, the sensitivity of the chosen dimension 
m on decreases very fast. In the case n = 200 , we have no variation in the choice 
of m with respect to m) . The oracles are respectively m* = 12 for n = 100, 200 and 
m* = 10 for n = 50. We also want to compare the true quantiles and their bootstrap 



Figure 4.2: The first three plots show an exemplary function with n = 50,100, 200 
observations. The right plot shows the fh chosen by the Bootstrap-SmA-Method as a 
function of the calibration dimension and the number of observations. 

substitute. Figure 4.3 plots the ratios of quantiles for all possible comparisons (mi, m 2 ) 
for the same function as before. Here we see that there is, as one would expect, still sig¬ 
nificant variation in the quantile ratios for small differences |mi — m 2 | • Nonetheless the 
method works very well as seen in Fig. 4.2, but the variability in the ratios implies the 
possibility to stabilize the procedure even more by introducing some smoothing scheme 
for the quantiles. 

Figure 4.4 again demonstrates the dependence of the ratios on m^. It is remarkable 
that the ratio is varying very slowly above m* = 12. We also give the results on the 
simulation of nhist = 100 repeated applications of the method to the same true underlying 
function observed with different realizations of the errors in Figure 4.5. 

The case of the estimation of the first derivative is similar. Figure 4.6 shows the 
numerical results for estimation of the derivative in the same model as above. One 
can see that the bootstrap-version of the SmA-procedure is again competitive with the 
procedure based on a known noise structure and the method does a good job of mimicking 
the oracle. 

One can conclude that the proposed procedure is really universal and demonstrates 
a very good performance in various settings. 
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Figure 4.3: Ratio of quantiles m2/3mi,m2p = 20 and n = 200 with the data 

and true function as in Fig. 4.2. 


o 



• Max. ratio 

• Mean ratio 

• Min. ratio 


Figure 4.4: Maximal, minimal and mean ratio of the bootstrap and theoretical tail 
functions at x = 2 , a function of . 



Observed 
Oracle-Est. (12) 
SmA-BS-Est. (12) 
SmA-Est. (11) 


Chosen model n 


Figure 4.5: In the left plot, the true function and observed values are plotted for one 
realization together with the oracle estimator, the known-variance SmA-Estimator (SmA- 
Est.) and the Bootstrap-SmA-Estimator (SmA-BS-Est.). The numbers in parentheses 
indicate the chosen model dimension. In the right plot, histograms for the selected model 
are given for the bootstrap (BS) and the known-variance method (MC) for repeated 
observations of the same underlying function with a simulation size nhist = 100. 
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— Oracle-Est. (13) 

— SmA-BS-Est. (11) 
SmA-Est. (11) 

I—jTrue derivative 



Observed 
!-•- True function 



X 


Figure 4.6: The upper left plot shows the true derivative, the oracle estimator, the known- 
variance SmA-Estimator (SmA-Est.) and the Bootstrap-SmA-Estimator (SmA-BS-Est.). 
The upper right plot shows the true function and the observations and in the lower plot 
one can find the standard deviation of the errors. 

A Proofs 

The appendix collects the proofs of announced results. 

A.l Proof of Theorem 2.1 

The propagation property (2.13) claims that the oracle model m* will be accepted with 
high probability. This yields that the selected model is not larger than m* , that is, 
in < m* with a probability at least 1 — e”’'. Below we consider only this event. Let 
m G M“(m*). Acceptance of m requires in particular that Tm*,m < The 

representation Tm*,m = \\bm*,m + im*,m\\ implies 

^ m* ,m ^ im* ^ ^ im* ,m) • 

Under (2.14) this yields 

lP{m is accepted) < lP{\\bm*,m + ^m*,m\\ < 3m*,m) 

< -^(||^m*,m|| > ^m*,m(Xs)) < e"^L (A.l) 

If the lower bound on the bias is fulfilled for all m G , then (A.l) helps to bound the 
probability of the event {in G : 

]P{m G M'^) < + ^m*,m|| < 3m*,m) < 6 < e 
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Therefore, the probability that the SmA-selector picks up a value m > m* or m £ 
is very small: 

]p(mG M+(m*) U < 2e"^. 

It remains to study the case when rh = m £ !M° = {m *) \ . We can use that fh 

is accepted, which implies by definition 

Ifm* ,m — II 0 m 0 m* 11 — hm* ,m • 

This yields (2.15). The bound (2.17) now follows by the triangle inequality. 

A.2 Proof of Proposition 2.2 

Below we use the deviation bound (D.2) for a Gaussian quadratic form from Theorem D.l. 
Note that similar results are available for non-Gaussian quadratic forms under exponen¬ 
tial moment conditions; see e.g. Spokoiny (2012). The result (D.2) combined with the 
Bonferroni correction qm° = log(|M'''(m°)|) < log(|M|) yields the following upper bound 
for the critical values im,m° '■ 

dm,m° ^ ^m,m° (x T Qm°') T PPrn,m° 

< (1 + /3)VPm,m° + y 2 Am,m° {x + log(|M+(m°)|)} 

< (1 -|- /3)-0Pm,m° + '^2Am,m° {x + log(|M|)}. (A.2) 

For the payment for adaptation 3 ^* , the result (A.2) and the monotonicity condition 
Pm*,m < Pm*,mo < Pm* and Am*,m < Am*,mo < Am* imply the following upper bound: 

3m* ^ (1 + /d)\/Pm*,mo + m* ,mo {x-plog(|M (m*)|)} 

< (1 -p/3)y^p))^-p \/ 2 Am 00 r+log([M|jy 
which yields the claim. 

A.3 Proof of Theorem 2.3 

The result will be proved in two steps. First we bound the risk on the set fh > m* : 

iF{ 110 — 0* 11^ ]l(m > m*)} < 2am*ikm* • (A.3) 

Then we consider the region fh < m* and prove an oracle inequality 
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and the oracle bound (2.25). We start by proving (A.3). Let us fix m G and 

m' > m. The dehnition (2.18) of the oracle m* and the formula (2.22) for the critical 
value implies for the test statistic 




Now we can bound the risk of (f) on the set fh > m* . We use that for m = m > m* in 
view of (2.19) 


110 = || 0 ^- 0*||2 = ||^^ + 6^||2 

< 2UJ\^ + 2\\bmf <2Umf + 2\\bm*f 


and it holds by (2.21) and monotonicity pm > Pm* 

iE;{|| 0 - ]l(m > m*)} 

<2 lE{{\\^^f+ \\bm*f)Hm = m)} 


mgM+(m*) 


<2 Y + \\bm*f) ]I(m(_i) is rejected)} 

mgM+(m*) 

= 2 Y ^ (ll^mf + ll^m*f)]l( max {ll^ II 

^ ^ Vm'gM+(m) 1 ’ ( u 

mgM+(m*) 

< 2 ^ ^ CKm(Pm T 10m* || ) ^ ^CXm* (Pm* T |0m* || ) ~ 2(3:^*“^. 


■2'm',m(_x) 


(xm)| > 0^ 


mgM+(m*) 

Here we have used that (2.24) and pm > Pm* imply Z]mgM+(m*)“m - “m* • This 
completes the proof of (A.3). 

In the situation when fh = m < m* , we can use the stability property: as m is 
accepted, it holds 


— cj)^* II ]I(m = m) < 3 




which implies (A.4) by definition of 3 ^* . This yields 

lE\\4>-(l)*f < 2am*0lm* + 1E{\\^ - Cf)*f -K{fh < m*)] 

< 2am* fkm* + (11 4*m* ~ 0* 11 + 3m* ) 

< 2am* ^m* + (^m* T3m*) 


as required. 
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A.4 Proof of Proposition 2.4 

Observe first that the choice am = (Pm/Pmo)~^~“ yields 


OmPm < Vmo Pm — ^Pm* Prr^j ~ Cam*Pm* 

mSM+(m*) mSM+(m*) 

with am* = C(pmo/Pm*)^'^“ • 

For any random vector ^ with Var(^) = B and p = tr{B) and any random event 
A, it holds 





< {l + p-2Var(||^||2)}'/VV2(^). 


(A.5) 


Indeed, the Cauchy-Schwartz inequality implies 

= {l + p-2Var(||^||2)}'/VV2(yl). 


Moreover, in the Gaussian case ^ ~ lsf(0,i?) with ||-B||op < 1, h holds Var(||^|p) < 2p. 
If p is large then Var(||4|p)/p^ is small. In general Var(||^|p)/p^ < 2. 

Result (A.5) and the choice am = \/3Pm^~“ allow to specify an upper bound on Xm ■ 
Namely, the choice Xm = C log(pm) ensures the propagation condition (2.20). To see 
this, hx m and m! >m. Let 

^m(x) il( max {no,mil - “ a/ 2A^/,^ {x + log(|M|)}} > o) 

The arguments after Lemma D.l with = 2(1 -l-a)log(pm) and (A.5) imply 


]E 




and by (2.22) 


3m,^ -\/Pm,m° + 

m,m° {(1 + a) log(pm°-Ll) + log(|M|)} . 


This implies the upper bound (2.28) on the payment for adaptation im* ■ 


A.5 Proof of Theorem 3.1 

Any statement on the use of bootstrap-tuned parameters faces the same fundamental 
problem: the bootstrap distribution is random and depends on the underlying sample. 
When we use such values for the original procedure, we have to account for this depen¬ 
dence. The statement of Theorem 3.1 is even more involved due to the presmoothing 
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step and multiplicity correction (3.3). The proof will be split into a couple of steps. First 
we evaluate the effect of the presmoothing bias and variance and reduce the study to an 
artificial situation where one uses the errors Si for resampling in place of the residuals 
Yi. Then we compare Q and Q ^ using the Pinsker inequality. 

Below we write iF in place of ; where M is the largest model in the collection. 
This does not conflict with our general setup, it is implicitly assumed that the largest 
model coincides with the original one. By p we denote the corresponding parameter 
dimension, that is, is a p x n matrix. Further, the feature matrix can be written 
as the product j where ilm is the projector on the subspace of the feature 

space spanned by the features from the model m : Um = iF^(lFm'F^) ^'Fm . This allows 
to represent each estimator in the form 

0 ^ = WOm = WSmY = = 7m^Y 

7m fF(<F™lF^)“^77^ . 


This implies the following representation of the stochastic components ^m m° ■ 




= T, 


m,m° 


•Fe = T, 




V, 


rr rp rr 


where V = . One can say that each stochastic vector is a linear function of 

the vector V. A similar representation holds true in the bootstrap world: 

= 7m,m^^ diagiY)w^ = 7m,moV\ dmg{Y)w\ 

Here the original errors e are replaced by their bootstrap surrogates = diag(l^)m*'. 
Therefore, it suffices to compare the distribution of V = iFe with the conditional distri¬ 
bution of V*' = iFdiag(l^)m^ given Y . Then the results will be automatically extended 
to any deterministic mapping of these two vectors. 

Normality of the errors e* ~ N(0, cr?) implies that V = !Fe is also normal zero mean: 

V~N(0,5), 5=^lFNlF^, = Var(e) = diag((Tf,..., cr^). 

Similarly we can use standard normality of the bootstrap weights . Given the data 
Y , the vector V ^ is conditionally normal zero mean with the conditional variance 

Var^(V‘') = 'Fdiag(U^...,F;2)lF^ = •Fdiag(y • Y)'F^. 


Therefore, the problem is reduced to comparing two p-dimensional Gaussian distri¬ 
butions with different covariance matrices. Equivalently, we have to bound the value 
A = ytr('B2) for a random p x p matrix B given by 
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Define a pxn matrix U = S so that UU^ = Ip. We will use the decomposition 

- HY) = - He) + - Hf) = r] + B 


with 

rj - Tie), B - Hf*). 

With the matrix T3 can now be represented as 

IB = U diag{{ri + B) ■ {ri + B) - Inp^ 

= U diag{(T7 + B) ■ {-q + B) - r] ■ q}U^ “Bi 

+ Udiag{q ■ q — lE{q ■ q)}U~^ B 2 

+ 11 dmg{lE{q ■ q) - Inp'^ = 


(A.6) 


(A.7) 


The first term Bj in this decomposition expresses the impact of the bias B remaining 
after presmoothing, the last two terms 232 and 233 measure the change of the noise 
covariance due to presmoothing. The triangle inequality in the Frobenius norm ||23||Fr 
pti(Bp and bounds from Propositions E.6, E.7, and E.8 with = Ip and q = q 2 = 
tr(TTTT''') =p imply on a random set fT 2 (x) = f7i2(x)Uf722(x) with TP(f72(x)) > 1 —2e“^ 

||23||Fr < ||23i||Fr + ||232||Fr + ll'BsllPr 
< /\l(x) A 2 (x) -h Z\ 3 (x) 

= 2y^(5|p(x-Mog(n)) + \A|p + \/||S||^p-b4<5| ||S|| (l y/I). 

This proves (3.8) in view of Pinsker’s Lemma E.l with b = = 0. 


A.6 Proof of Theorem 3.2 

The result of Theorem 3.1 justifies the bootstrap-phenomenon, namely it explains why the 
known bootstrap distribution can be used as a proxy for the unknown error distribution. 
However, it cannot be applied directly to (3.9) because the quantities 
are random and depend on the original data. This especially concerns the multiplicity 
correction q^o which is based on the joint distribution of the vectors from (3.1) 

and is defined in (3.3). The latter distribution is a random measure in the bootstrap 
world which is normal conditioned on the original sample. To cope with the problem of 
this cross-dependence, we apply the statement of Theorem B.l in the Appendix. The 
underlying idea is to use geometric arguments to sandwich the random probability in 
(3.3) in two deterministic probabilities. Then the error of bootstrap approximation can 
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again be bounded by using the Pinsker inequality. The statement of Theorem 3.2 can 
be derived from Theorem B.l if an operator norm bound ||lB||op is available. Note that 
Theorem 3.1 only requires a bound for the Frobenius norm. By Proposition E.9, it holds 
with (5„ = (5 i 2/ , x„ = X + log(n), and Xp = x + 2 log(p) 

PIlop < Pp(x), 

^op(x) = ||S||^ + 5|r ||S|| \/^ + 2(5ijrXp^^ + 2(l^Xp + 25ijrXn + (5|rXn. 

The result of the theorem follows now by Theorem B.l. 

A.7 Proof of Theorem 3.3 

For a fixed pair m > m° from M, consider Pm,m° = 

iFlI^rn m° IP • diag(T) and E are diagonal matrices, the definitions (3.1) and (A.6) 
imply 

dmg{Y)w^ = di?,g{Y)w^ 

= Um,m° diag(77 + S)mp 
where Um,m° *= ■ It holds for p^^^o 

Pm,m° diag{(r/+ S) • (r? + S)} 7/^ ) 

while $m,m° = and 

Pm,m° = ~ ^rn,rn°') ■ 

As we are interested in the ratio Pm,m° /Pm,m° , one can assume without loss of generality 
that \\h(m,m° rn°\\op = 1 ^nd Prn,m° > 1- Now we again apply the decomposition 
(A.7). The bounds (E.14) of Proposition E.6, (E.17) of Proposition E.7, and (E.19) of 
Proposition E.8 imply on a set with iP(l?m,m°(x)) > 1 — 3e“^ 

< \\B\\l^ + Ax^/H^\\B\\+ix^/^6n + ^x6^ + 5e. 

The choice of x = xjvc = x + 21og(|!M|) ensures a uniform bound for all pairs m > m° 
from M. 

B Random multiplicity correction 

Suppose that is a random positive symmetric p x p matrix close to a deterministic 
matrix V. Below we use the operator norm for quantifying the difference between V 


Pm.T 


- 1 
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and : namely let with probability one 

\\V-^/^vW-^/^-Ip\\op< ^0- (B.l) 

In what follows, IP = 3^1(0, V) is the normal measure on with mean zero and 
covariance V . Similarly iP'’ is a random measure on which is conditionally on 
normal with IP^ = 1x1(0, y*'). Suppose that for each m from a given set M a linear 
mapping Tm : —?■ is fixed. Given x, define for each m G M the corresponding 

tail function Zm(^) by 

]p{u: \\Tmu\\ > Zm{^)} = (B.2) 

Also define a set A(x) as 



Similarly define z!^{x) by (B.2) with IP^ in place of IP, m G M, and A^(x). Note 
that all these objects are random because IP^ is random. Finally, let be the random 
quantity providing 

1P\a\xI)) = 1 - a. (B.3) 

Below we try to address the question whether this random multiplicity correction based 
on (B.3) does a good job under IP . This question leads to analysis of value IP(A*'(Xq)) : 
the goal is in evaluating the difference 

JP(.4‘(i:i)) - (1 - a). 

The analysis is non-trivial because A^(x) and x^ are random. 

Theorem B.l. Let a random matrix satisfy (B.l) for a deterministic matrix V 
and Ao < 1/2 . Then it holds 

\lP{A\x^J)-l + a\<^Ao. (B.4) 

Proof. The key property of IP^ = N(0, I^'’) is that the random matrix concentrates 
around some deterministic matrix. Below we use this property in the bracketing form: 

< F*' < F+ 


with 


V- (1 - Z\o)F, F+ (1 + Ao)V, V+-V- = 2AoV. 


(B.5) 


SPOKOINY, V.AND WILLRICH, N. 


37 


In other words, the random matrix can be sandwiched in two deterministic matrices 
V~ and V~^ . For the proof of (B.4) we use the following well known property of the 
Gaussian distribution. 

Lemma B.2. Let IPi ~ ^(OjGi) and IP 2 ~ 1x1(0,!^) with Vi < V 2 . Then for any 
centrally symmetric star-shaped set A, it holds 

IPi{A) > 1P2{A). 


Proof. The statement is trivial in the univariate case, the general case is obtained by 
integration over A in polar coordinates. □ 

Introduce two Gaussian measures 1P~ = 74(0,17“) and IP'^ = 74(0,17^); see (B.5). 
Let 2 m(x) and ^^(x) be the corresponding tail functions, and ^“(x) and ^"''(x) - the 
corresponding sets. The identities (B.5) yield for each x the relation 

iP+(^+(x)) = iP“(A“(x)). (B.6) 

Lemma B.2 implies by (B.5) for any x 

IP+(A(x)) < 1P\A{x)) < ]p-{A{x)). (B.7) 


The key step of the proof is given by the next lemma where we sandwich the random 
set ^^(x^) in two specially constructed deterministic sets. 


Lemma B.3. Define the deterministic values x^ and x+ by the equations 

iP+(^“(x+)) = 1 - a, 

1P-{A+{X-)) = 1-a. 

Then 


(B.8) 


X„ < X„ < xj 


‘■Q: 

1 b ' 


A-(x-) c A\xi) c yl+(x+). 


(B.9) 


Proof. By Lemma B.2 the following inequalities and inclusions hold true for any x: 

^■(x) < zl^{x) < z+{x), 

A~{x) C A\x) C A+(x). 

Now by definition (B.8) in view of (B.7) and (B.IO) 

]P\A\xt)) > IP+(7l^(x+)) > iP+(7l-(x+)) = 1 - a, 

1P\a\x-)) < lP-{A\x-)) < iP-(7l+(x-)) = 1 - a. 


(B.IO) 
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This yields by monotonicity of ]P^{A^(x)) in x that x^ from (B.3) belongs to the 
interval [x“,x+] and 

A-{x-) c yl^(x-) c A\xi) c yl^(x+) c A+(x+). 

This implies the result. □ 

Now we are prepared to finalize the proof. The relations (B.9) and (B. 6 ) imply 
lP^{A\xi)) < iP+(A+(x+)) =iP-(A-(x+)). 

Furthermore, it holds by Pinsker’ inequality Corollary F.2 in view of (B.l) and (B. 8 ) 
]P~{A~{x^)) < JP+(A“(x+)) -h y/pAo < 1 -a + 

Similarly 

lP-{A\x}>^)) > fp-(kl-(x-)) =fP+(kl+(x-)) 

> fP“(^+(x“)) - ^Z\o = 1 - a - ^Z\o- 

This implies (B.4) for the measure IP . □ 

C Deviation bounds for Gaussian law 

This section collects some simple but useful facts about the properties of the multivariate 
standard normal distribution. Many similar results can be found in the literature, we 
present the proofs to keep the presentation self-contained. Everywhere in this section 7 
means a standard normal vector in FU’. 

Lemma C.l. Let p G (0,1) • Then for any vector X G IRP with ||A|p < p and any 
r > 0 

logiE{exp(A’^ 7 ) ]I(|| 7 || > r)} < -h ^||A|p + ^log{p-^). (C.l) 

Moreover, if > 6 p -|- 4x, then 

iE|exp(A^ 7 ) 1 I(|| 7 || < r) | > _ y 2 ) 

Proof. We use that for p < 1 


.E'{exp(A''^7) ]l(||7|| > r)} < e '^^^^/^ll'exp{A^7 -|-(1 —/i)||7|p/2}. 
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It holds 

iBexp{A’^7 +(1-^)||7||V2} = j exp{A "^7 -/i|| 7 f/ 2 }d 7 

= exp(/i“^||A|p/2) 

and (C.l) follows. 

Now we apply this result with fi = 1/2. In view of iBexp(A''' 7 ) = > 

6 p + 4x , and 2 + log(2) < 3 , it follows for || A|p < p 

e-ll^lP/ 2 jK|exp(A^ 7 ) ]l(|| 7 || < r)} 

> 1 — exp(—r^/4 + P+ {p/2) log( 2 )) > 1 — exp(—x) 
which implies (C.2). □ 

Lemma C.2. For any u G IHF, any unit vector a G , and any 2 ; > 0, it holds 

IP{Wj — u\\ > z) < exp{— 2;^/4 + p /2 + ||u|p/ 2 }, (C.3) 

iE{|7^a|2]l(||7--u|| >z)} < (2 +|u'^a|2)exp{-zV4+p/2 +ll'uf/2}. (C.4) 
Proof. By the exponential Chebyshev inequality, for any A < 1 

iP (||7 — u|| > z) < exp( —A 2 :^/ 2 )lEexp(A ||7 — u\\‘^/2) 

= - 5’“‘5(1 - ■') + 

In particular, with A = 1/2, this implies (C.3). Further, for ||a|| = 1 

iE{|7^ap 2(117-^11 > z)] < exp(-zV4)iF{|7^a|2exp(||7 - u||2/4)} 

< (2 + |u'''a|2) exp(—2;2/4 + p/2 + ||ri||2/2) 

and (C.4) follows. □ 

D Deviation bounds for Gaussian quadratic forms 

This section collects some deviation bounds for Gaussian quadratic forms. The next 
result explains the concentration effect of B'^ for a standard Gaussian vector 7 and 
a symmetric matrix B. We use a version from Laurent and Massart (2000). 

Theorem D.l. Let 7 6 e a standard normal Gaussian vector and B be symmetric 
positive. Then with p = tr(B), = tr(B^), and A = ||i?||op , it holds for each x > 0 


lP{'j~^ Bj > P + 2vx^/2 _|_ 2 Ax) < e 


(D.l) 
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This implies for any positive B 

lP{\\B^/‘^-i\\ > p^/2 ^ (2 Ax) 1/2^ < e■^ 


Also 


lP(y~^B^ < p — 2vx^/^) < e 


(D.2) 


(D.3) 


If B is symmetric but non necessarily positive then 

B'j — p| > 2vx^/^ -|- 2Ax) < 2e“^. 


Proof. Normalisation by A reduces the statement to the case with A = 1. Further, the 
standard rotating arguments allow to reduce the Gaussian quadratic form ||7|p to the 
chi-squared form: 

j'^Bj = 

i=i 

with independent standard normal r.v.’s uj . Here Xj G [0,1] are eigenvalues of B , and 
p = Ai Ap , = Af Ap . One can easily compute the exponential moment 

of — p)/2 : for each positive jj, < 1 

1 ^ 

logiEexp{^(7’^H7 - p)/2} = - - log(l - tiXj)}. (D.4) 

i=i 


Lemma D.2. Let fiXj < 1 and Aj < 1. Then 


1 ^ 

9 - hXj)} < 


i=i 


4(1-m)‘ 


Proof. In view of fiXj < 1, it holds for every j 


/ \ \k 

-fvXj - log(l - p.Xj) = ^ ^ ^ 


k=2 


2 - 2(1-mA,) - 2(1-m)’ 


(D.5) 


and thus 


' - log(l - ^A,)} < ^ ALL_ < rv- 


J = 1 


^4(1-^) -4(l-/i)- 


□ 
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The next technical lemma is helpful. 

Lemma D.3. For each v > 0 and x > 0, it holds 

inf|-^(vxi/2 + x)+-i^j < -X. 

n>o{ ’ 4(1-^)] 


Proof. Let pick up 


1 _ xV2 

2xV2/v + 1 “ xV2 + v/2’ 


so that fi/{l —/i) = 2x^/^/v. Then 


-//(vx^/2 _^x) + 


2„2 


H V 




= -/x(vx^/^ + X +vV4) + 


/iV 


4(1-/.) 


.1/2 


.1/2 + v/2 


(x1/2 + v/2)^ + 


2x1/2. 


= —X 


and the result follows. □ 

Now we apply the Markov inequality 

loglP( 7 '''il 7 > p + 2 vx ^/2 + 2 x) = logiP(( 7 ^i ?7 — p )/2 > vx ^/2 + x) 

< inf/.(vx^/^ + x) + log. 2 ?exp{/.( 7"'^57 — p)/ 2 }| 

< mf(-^{,x‘/=+x) + -TrT| <-x 

n>o{ ^ 4(1-/i)J 

and the first assertion (D.l) follows. The second statement follows from the first one by 
tr(i?2) < ||i?||op tr(il) = Ap. 

Similarly for any fj, > 0 

IPB-f — Tp < — 2 v-v/x) < exp(—/.vv^)lEexp^—^( 7 ''~il 7 — p)^ 


By (D.4) 


logiBexp{-/.( 7 ’^S 7 -p)/ 2 } = ^ - log(l +/.Aj)}. 

1=1 


and 


1 ^ 

-log(l + /.Aj)} 

1=1 


2 2^ 2^ k - 2^ 4: 4 

1=1 k=2 1=1 
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Here the choice n = 2\/x/v yields (D.3). 

One can put together the arguments used for obtaining the lower and the upper bound 
for getting a bound for a general quadratic form 'y~^ , where B is symmetric but not 

necessarily positive. □ 


Finally we apply this result to weighted sums of centered 7 ? . 

Corollary D.4. For any unit vector u = (ui) G and standard normal r.v. ’s 'ji, it 
holds with ||u||oo max* \ui\ 


IP 


n . 

-h 2||ri||ooX j < 2e“''. 
i=l ' 


Proof. The statement follows directly from Theorem D.l. It suffices to notice = 

\\uP = l. □ 


As a special case, we present a bound for the chi-squared distribution corresponding 
to B = Ip. Then tr{B) = p, tr(H^) = p and \{B) = 1. 

Corollary D.5. Let 'j be a standard normal vector in . Then 


lP{\hf >p + 2^ + 2x) < e-^ 

^(IItII >^/p + V^) < e“^, 

lP{\\jf <p-2y/^) <e~^. 


The previous results are mainly stated for a standard Gaussian vector 7 G . Now 
we extend it to the case of a zero mean Gaussian vector ^ with the n x n covariance 
matrix V = {(Jij) with Amax(V) < A* . Given a unit vector u = (ui,... ,Un)~^ G , 
consider the quadratic form 

n 

Q = 

i=l 

We aim at bounding Q — EQ. To apply the result of Theorem D.l represent Q as 
B'y with B depending on u and V. More precisely, let ^ = V ^/^7 for a standard 
Gaussian vector 7 G . Then with U = diag(ui,..., Un) , it holds 

S = = tr(?7V^/S7^V^^2) = tr(H77^) = 7^57 

with B = . Therefore, the bound ||V||op < A* implies 

A = \{B) = ||V^/ 2 j 7 v 1 / 2 ||^^ < ^ 

v^ = tr(H2) = tr(V^/ 2 [/Y[jYi/ 2 ^ < A*tr(l7VI7) < A*^||-uf = A*^ 
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Now the general results of Theorem D.l implies the result similar to Corollary D.4. 

Corollary D.6. For any unit vector u = [m) G IRF, ||ri|| = 1, and normal zero mean 
vector ^ ~ ^(0, V) in with ||V||op < A* , it holds 



IP 


It is worth noting that the identity \\u\\ = 1 implies ||w||oo < 1 • Moreover, in typical 
situations, ||ri||oo , and the leading term in the bounds of Corollaries D.4 and 

D. 6 is 2A* xi/2 ^ 

E Sums of random matrices 

Here we present a number of deviation bounds for a sum of random matrices. 

E. l Matrix Bernstein inequality 

This section collects some useful facts about deviation of stochastic matrices from their 
mean. We mainly use the arguments from the book Tropp (2015). The main step of the 
proof is the following Master bound. 

Theorem E.l (Master bound). Assume that Si,..., Sn are independent Hermitian 
matrices of the same size and Z = Yll=i ■ Then 


®Ai,,^(Z) < inf ^logtr exp 




where Xf^^y-{Z) denotes the algebraically largest eigenvalue of Z. 


For the proof see e.g. Tropp (2015). 


The same result applied to —Z yields the bound for the operator norm \\Z 




(E.l) 
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E.2 Matrix deviation bounds 

The next result provides a deviation bound for a matrix-valued quadratic forms. 

Proposition E.2 (Deviation bound for matrix quadratic forms). Consider a p x n 
matrix U such that 


UU^ = Ip. 

Let the columns cji,..., G IRP of the matrix U satisfy 

< dn (E.2) 

for a fixed constant 6n ■ For a random vector 7 = ( 71 ,... , 7 n)''~ with independent stan¬ 
dard Gaussian components, define 

n 

Z U diag {7 • 7 - l]U^ = . 

i=l 


Then with Xp = x -|- log(2p) 


ll-^llop > 2(i„^/x);-h 2(5„Xp) <e 




Proof. From the Master bound (E.l) 

dP{\\Z\\op > z) < inf e~^^ tr exp | log IEexp{9{'yf — l)u:iCL!] 

\i=l / 

-|- inf e“®^ tr exp Fi exp( 0 (— 7 ^ -|- l)uiLoJ )^ 

Now we use the following general fact: 

Lemma E.3. If x is a random variable and 11 is a projector in IRP, then 

log IE eyLp{xII) = log(iE'e^)lZ'. 


(E.3) 


(E.4) 


(E.5) 


Proof. The result (E.5) can be easily obtained by applying twice the spectral mapping 
theorem. □ 


This result yields, in particular, for any unit vector u G 


log IE exp = log(iE'e^)a;a;''~. 
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Moreover, for any vector cj G ]RP, the normalized product is a rank-one 

projector, and hence. 


logiE'exp(x<^^^) = log(iE'e^ll‘^ll ) 




T 


u; 


With Ui ujiujj /||ajj|p and Xi = Q{li — 1), we derive 

log iE exp { 0 ( 7 ,^ - l)ujiuj} = log IE ex.-p{e{xf - l)\\uif}Ui 

' exp{-\\ujipe) 


= log 


vi-n^iW^o 


u^ 


= \ -\\uji\\^e - -iog(i - 2e\\uji\\^) \Ui 


and 


logiEexp{(9(- 7 ,^ l)uji 0 jj} = logiEexp {Oi-jf + l)||cJi|p) Ui 

< -WuJifeU^ 

< [-\\ui\\‘^9 - ^ log(l - 2e\\ui\\‘^)]Ui. 


Then it follows by (E.4) 


lP{\\Z\\op > z) 


f "" 

< 2 inf tr exp< 

0 >o 1 ^ 

^ 2=1 


r 


20 


1 

2 


log(l - 20||u;i||2) 



(E. 6 ) 


Denote 77 = ( 71 ,, rfn}~^ , where 


Vi = -0 - 


log(l - 2 ||u;^||20) 

2 ||u ;,:||2 


The use of (D.5) and (E.2) yields for 6 < (26^) ^ 

= 911,1 ||2 {^^ll^^ll^ “ “ 20||cJi||2)} 

•^ 11^2 II 

_{29m7^ 

- 4||u.,||2(l-29® - (1-29®- 

Then by (E.6) and L{U~^ = Ip using 7 = 205^ 


IP \\Z 


lop > 2 ) < 2 inf e tr exp{^ diag(r/)Z 2 ^} < 2 inf e tr exp{ ||r 7 ||oo 2 p} 


0 >o 


e>o 


< 2p inf exp< —9z + 
e>o ' 




I- 2961 S /^>o 


= 2p inf exp< —/x 


2x-2 


+ 




261 1 -/X 
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Lemma D.3 helps to bound for Xp = x -|- log(2p) and z = 2J„Xp'^^ -|- 25^Xp that 






< -Xp . 


Therefore, 


P{ ll-^llop > + 25lxp < 2pe =e 


as required. 


□ 


Proposition E.4 (Deviation bound for matrix Gaussian sums). Let vectors ui,..., Un 
in ]RP satisfy 


Vi>i 


< S„ 


for a fixed constant 5n ■ Let 7 ^ be independent standard Gaussian, i = 1,... ,n . Then 
for each vector B = {bi,..., bn)~^ G JR" , the matrix Zi with 


-^1 

i=l 

fulfills 

lp(^\\Zi\\op>5^jB\\V^^ < 2e-\ 

Proof. As 7 j are i.i.d. standard normal and = e“^/^ for |a| < 1 / 2 , it follows from 

the Master inequality and Lemma E.3 


iPiWz 


op 


> ^) < 


2 inf e“^^' 
e>o 


tr exp<^ ^ log IE exp{e^ihiOJiuJ) 


i=l 


< 2 inf e 
e>o 


-ez 


tr exp 


i=i 


Il4 T 

Ui\r UiU 




Moreover, as ||cjj|| < 6n and Ui = uJiuJ /||a;j|p is a rank-one projector with tr 17* = 1, 
it holds 


tr exp 




< exptr 


e'^si 




1=1 



This implies for z = (5^||S||\/^ 

iP(||Zi||op > z) < 2 inf exp^—= 20 “^^ 
and the assertion follows. □ 
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E.3 Matrix valued quadratic forms 

Let ^ = (^ 1 , • ■ ■, ^n)~^ £ -ff?” be a Gaussian zero mean vector with the covariance matrix 
V such that ||V||op = Amax(V) < A* . Let also U he a p x n matrix with columns 
Ui,..., Un G IRP such that 

n 

tiiUhT) = ^ ||a;i|p < q, 

i=i (E.7) 

max < 5n- 

i 

A typical situation we have in mind is when 

n 

= Ip. 

i=l 

Then (E.7) is satisfied with q = p . Moreover, it also holds tr{(L/L/''~)^} = tr{Ul{^) = p . 
Consider the p x p random matrix ®o given by 

n 

T>o = Y.u^^u]{^^-]Eei). (E.8) 

i=l 

In the case of V = /^, Proposition E.2 provides a bound for the operator norm of IBo . 
Below we extend this result to the case of a general matrix V and establish similar bounds 
for quadratic forms of a non-centered vectors ^ + B . Also we evaluate the nuclear and 
Frobenius norms of this matrix. We begin with establishing a bound on the Frobenius 
norm of “Bq . 

Proposition E.5. Let vectors u^i G MP fulfill (E.7). Let also ^ ~ ^(OjV) be a zero 
mean Gaussian vector ||V||op < A* . Then the random matrix Bq from (E.8) satisfies 

IP (tr(®2) > 2A* 52 q (xy2 + X,)) < e-^ (E.9) 

where 5* < 1 and x„ = x + log(n) . 

Proof. Denote rji = Iff — JEff and Cij = ujUj . Then 

n n n n 

^^(®o) ~ ~ CijhiVj • 

2=1 j = l 2=1 j = l 

The n X n matrix C = {cfj) is obviously symmetric positive. Therefore, one can 
represent it in the form C = U]VIU~^ for a diagonal matrix M = diag(pi,..., pn) and 
an orthonormal nxn matrix U = (ui,... ,Un) whose columns are orthonormal 
vectors in IR^ . Therefore, for the vector r] = (r/i,..., , 

n 

trCBg) = rf'Cr] = -q^UMU^r] = Pk\uJv\‘^. 

k=l 


(E.IO) 
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Further, one can bound each ujr] by the result of Corollary D.6: for any > 0 

ip(\ulr]\ > + Il'UfcllooXn)) < 

The choice = x -|- log(n) and (E.IO) imply 

iP^tr(®o) > ^2AVfc(xy^-h ||ttfc||ooX„)'j < e"^. 

k=l ^ 

Also by construction and (E.7) 

n n n / ^ \ 

=tr(C) = ^c2. = ^||a;if = d'^q. 

k=l i=l i=l '^i=l ^ 

The result (E.9) uses a very rough bound ||itfc||oo < for some constant (5* < 1. In 
typical situation one can refine it to ||ufc||oo < C(5„ . □ 


Now we consider a slightly more general situation with a bias component. Given a 
bias vector B in IR^ , a p x n matrix U , and a stochastic Gaussian zero mean vector 
^, define a random p x p matrix 

Si '^=1' ^7diag{(^ + B) . (^ + s) - 4 

The next result bounds the values tr(®i) and tr(‘B^). 

Proposition E.6. Suppose that a Gaussian vector ^ ~ N(0,V) satisfies 

=' ||V-/„||op, (E.ll) 


Let also lAU 


T 


< Ip 


and the vectors Ui - columns of U - satisfy for some q 2 < q 


iiiUlT') = < q, 

i=l 
n 

ii{{UU^f] = '^\ujuj\‘^ < q2, 

*d=i 

max ||ci^i|| < Sn- 

i 

Then on a random set l7io(x) with iP(l7io(x)) > 1 — 2e“^, it holds 

||23i|Lp < \\B\\l + 6l\\B\\V^ 

and on a random set I7ii(x) with iP(l?ii(x)) > 1 — e”’^, it holds 


(E.12) 


(E.13) 
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Further, ||23i||Fr = -\/tr(‘B^) fulfills on a random set i?i 2 (x) with iP(^2i2(x)) 

||®l||Fr < 4\i(x), 

Z^l(x) V||S||^q 2 + 452 ||B||(l + VS). 

Proof. We use the representation 

= Ud[ag{i$ + B) ■ {$ + B)}U'^ 

= U diag{B ■ B}U^ +2U diag{^ ■ B}U'^ . 

'-v-' '-V-" 

25ii ®12 

It obviously holds with ||23i||Fr 

|tr(iBi)| < |tr(iBii)| + |tr(®i 2 )| , 
ll'BlIlPr < ||iBii||Fr + ||‘Bl2||Fr • 

We proceed with each for m = 1,2 separately starting from ®ii. 
Bounds for 23n : The bias term 23n can be estimated as follows: 

II Mop 

tr(Tii) 

Another way of bounding the value tr(23f]^) is based on the condition ||cuj|| 5 
for any unit vector 7 G IBP , it holds | 7 ''^cJi| < 5n and 

n n 

7'^23n7 = ^ 5l^hj = 5l\\Bf. 

i=l i=l 

Therefore, ||23ii||op < (5^||-B|p and hence, 

tr(232,) < [5l\\Bffp = 6t,\\Btp. 

Note, however, that the bound (E.16) is typically more accurate: the value <5^ 
q/n and ||-B|p x n||B||^ , so that » q 2 ||-B||^ for p large. 


<\\Bf\\UU^\\ < ||S||2 , 

— II 1100 11 11 op — II 1100 ’ 


bi 

i=l 

< \\B\\l^ir{UU'^) <q||B| 

/ n .2 

< 92 l|S| 


|2 

loo ’ 


2 = 1 


4 

00 • 


> 1 - e"^ 


(E.15) 


(E.16) 


5n ■ Then 


is of order 
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Bounds for 23 12 : Proposition E.4 provides a bound for 23 12 in the operator norm for 
standard Gaussian ^: 

2p(pi2||op < 2e-^ 

Now we bound tr(23i2 ) and tr(23^2) • By definition, 

n \ n 

= 2 ^ Wuif^ibi = 2u^$, 
i=l ^ i=l 

where u is the vector in iR" with the entries Ui = As Var(^) = V with 

||V||op < 2 , ^ is a Gaussian zero mean random variable whose variance satisfies 

n 

< 2\\uf = 2 ^ < 25^||Sf. 

i=l 

Here we have used (E.ll) and ||cjj|| < 6n- Therefore, on a random set f 7 i 2 (x) with 
iP(f 2 i 2 (x)) > 1-e-", 

|tr(Bi 2 )| < 2 V 25 I ||B|| zi(x) < 4 x^/ 2^2 ||_g||^ 

where ^;i(x) < 1 /^ is given by 2 P(| 7 | > ^i(x)) = e"’^ for a standard normal 7 . 

It remains to bound tr(23f2) • Because of cross-dependence of the ’s, we cannot 
directly apply the result of Proposition E.4. Instead we use the following representation: 

n 

tr(T? 2 ) = . 

*j=i 

Denote by Ci the n x n matrix with the entries (ujujj)‘^bibj for i,j = 1,... ,n. The 
use of ^ = V ^/^7 with V = Var(^) and a standard normal 7 G yields 


tr(23i2) = 2tr 


tr(®?2) = = 47^V^/2c'^y1/2^ ^ 47^C27, 


where C 2 = . Now the bound of Proposition D.l on Gaussian quadratic 

forms is well applicable. It holds 

n 

p(C 2 ) = ir{C 2 ) < 2tr(Ci) = 2 ^ ||a;,||\2 < 26t\\Bf. 

i=l 

Similarly for any unit vector u G 2R” , it holds by \(jjJU j\ <5l 

n / n \ 2 

UiUjbibj{u:J< 5^('^^UibA < 5^ ||u||^ ||B||^ = 5^ ||B||^ 
i,j=l ^ 
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yielding Amax(C'i) <df^\\B\\‘^ and 

A(C2) =' A„,ax(C'2) < \\Bf. 

Proposition D.l implies on a random set of probability at least 1 — e“^ 

= 2^7^ <^27 

< 2Vp(C2) + 2V2A(C2)x 
= 2V25l\\B\\ (l + \/^) 

< A5l\\B\\{l + V^). 

Putting all bounds together yields by (E.15) the statements of the proposition. □ 

Now we evaluate the effect of presmoothing in the stochastic component. Let ^ be 
normal zero mean vector in ]RP with a covariance matrix V satisfying (E.ll). The goal 
is to bound the values tr('B 2 ) and tr(lB 2 ) for the difference 

B2 =^L/diag{^-4-iE(^-0W^- 

Proposition E.7. Suppose that ^ rsj J^{0,Y) with V satisfying (E.ll). Let also lAlA^ < 
Ip and the vectors Ui - columns of U - satisfy (E.12) for some q 2 < q- Then on a 
random set 1222 (x) with iP(l722(x)) > 1 — e~^ , it holds for ||T 2 ||Fr = 

p2||Fr < ^2(x) 2 v^x^qd^ . 

Moreover, on a random set I72i(x) with JP(l72i(x)) > 1 — 2e“^, it holds 

|tr(iB2)| < 4\/xq(52 +4 x,5^. (E.17) 

Proof. By (E.ll), the covariance matrix V = Var(^) fulfills 

llVilop < l + 5i<2. 

Now, by Proposition E.5, for x,i = x + log(n) and (5* < 1, it holds on a random set 
‘^22(^) with iP(l722(^)) — 1 — 

tT{T,l) <2{l + 5,)6lq{xi/^ + 6*Xn). 

Here <5* < 1, usually <5* <C 1, and (5i < 1, so we simplify the bound to 


tr(®2) < 4x„q(5^. 
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Now we bound the trace tr(232). By definition, it holds for columns cjj G of U 

n n 

tr(lB2) = - lE^f) = Y, - JEih, 

i=\ i=l 

and by Corollary D.6, it holds on a random set I72i(x) with iP(l72i(x)) > 1 — 2e“^ 

/ n X 1/2 

|ti'(232)| < 4x^/^ ( llcjjll^ j -|-4xmax llcjjll^. 


This implies in view of ||cjj|| < 5n and (E.12) 

n n 

Y ^ max||aJi|p^ ||cjj||^ < qSl- 

i=l i=l 

Therefore, it holds on 1721 (x) 


|tr(‘B2)| < 4v^xqJ2 +4x(52 


as required. □ 

Finally we bound the deterministic term 

®3 =Udmg{]E{^ ■ (E.18) 

Proposition E.8. Suppose that ^ ~ !N(0,V) with 


6 s = max — 1|. 
i 

Let also UlA^ < Ip and the veetors uji - columns of U - satisfy (E.12) for some q2 < q. 
Then it holds for the matrix IB3 from (E.18) 


|tr(233)| < 4q, (E.19) 

tr(a3i) <Zi3(x)"='52q2. 


Proof. By direct calculus, it holds in view of (E.12) and \lEff — 1| < 6^ 

/ n X 2 

tr(®|) = iv[Y^i^l - 1) j < 6 l ti{{UU^f} < 61 ^ 2 ] 
^i=l 

Further, 

|tr('B3)| < ill Y^i^l ~ l|) ^ 6 sir{UVr) < (5^ q. 

This yields the assertion. 


□ 
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Now we present a bound in the operator norm of the matrix “B with 

B =l^dmg{(^ + B)-(^ + B)- InjU^ . (E.20) 

Proposition E.9. Assume the conditions of Proposition E.6, and let the rows of 
^ <M ^-ii 2 jj^il 2 gdfigflgg || 2 ^|| < _ Then on a random set f7op(x) with iP(i7op(x)) > 

1 — 6 e“^ , it holds with x„ = x + log(n) and Xp = x + 2 log(p) for B from (E.20) 

||®||op ^ 2 \op(x), 

Z\op(x) ||B||^ + dl\\B\\V^ + + 2dn^n + 

Proof. We use the following decomposition: 

23 =^^diag{(^ + S).(^ + B)-4.^}^^T 1l^23i 

+ B diag{^ • ^ - 7 • 7 }^"^ 

+ U diag {7 • 7 - In}B'^ B^ 

Here 7 is a standard Gaussian vector. Obviously 

||23||op < ||23iHop + ||234||op + ||235Hop . 

The value H^illop is already evaluated in (E.13) of Proposition E. 6 : 

H^lHop < H-®llL + 

on a random set l7io(x) with IP (^Qiq (x)) > 1 — 2e *. The matrix B^ can be bounded 
by a version of matrix Bernstein inequality (E.3) in Proposition E.2: one a set f 25 (x) 
with iP(l 75 (x)) > 1 — 2 e“^ 

H'BsHop < 25n^ + 26lyip. 

It remains to bound the value H234Hop • By definition, with T = 

This obviously implies 

= (E 7 ) • (E 7 ) - 2 (r 7 ) • 7 


®4Hop < ||Z^diag{(T7) • (r7)}W’^||^p + 2||Wdiag{(r7) • 7}^"^ 


and 
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The condition UU < Ip helps to bound 

||^Ydiag{(r7) • < ||r7||L < \\r-r\\l. 

Similarly 

\\Ud\a.g{{T-f) < ||r7||oo II7II00 WUU^Wop < ||T’7||c 

It is well known that the sup-norm of a standard Gaussian vector 7 can be bounded as 

00 ^ 


00 / 00 


with = X + log(n) on a set of probability 1 — e“^. Further, if each row Tj of T 
satisfies ||T^|| < 5^, then the scalar product TJ^ is a normal zero mean r.v. with the 
variance 


Var(77) = \\n?<5^ 


and 


]P{\Tj'i\ > 5nZi{yi)) < e ^ 

with -Zi(x) < \/^ yielding 

^(II^tIIoo > < e“^. 

Summing together results in the bound 

1 1 ^ 4 ! I op ^ 2(5nXn + 

on a set i74(x) with ^(l74(x)) > 1 — 2e“^ . 


□ 


F Gaussian comparison via KL-divergence and Pinsker’s 
inequality 

Suppose that two p-dimensional zero mean Gaussian vectors ^ ~ N(0,5) and ~ 
>^■(0,5^) are given. Let also T map IRP to IR^ and X = r(^) and Y = . We 

aim to bound the distance between distributions of X and Y under the conditions 

115 - 1 / 25 ^ 5 - 1/2 _ < e< 1 / 2 , 

tr(5-i/25^5-i/2 - Ipf < (F.l) 

for some e < 1/2 and Z\ > 0 . The next lemma bounds from above the Kullback-Leibler 
divergence between two normal distributions. 
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Lemma F.l. Let Pq = N(6, S) and Pi = 5^) for some non-degenerated matriees 

S and 5^. If 

115 - 1 / 25 ^ 5 - 1 / 2 -/pilop < 1/2, 
tr|(5-i/25^5-i/2-/p)^| < Z\2, 

then 

%{Po,Pi) = -/^^olog^ <^ + ]^{b-b^)^S\b-b^). 

For any measurable set A C , it holds 

\Po{A) - Pi{A)\ < ^%{Po,Pi)/2. 

Proof. The change of variables u = S~^/‘^{x — b) reduces the general case to the situation 
when iPo is standard normal in P^ while Pi = with (3 = 5i/2(b^ — b) and 

B 5 - 1 / 25 ^ 5 - 1/2 

2 log ^( 7 ) = logdet(P) - (7 - / 3 )’^P (7 - /3) + ||7f 
with 7 standard normal and 

2%{Po, Pi) = -2iEo log ^ = - logdet(P) + tr(P - 4 ) + (iTB(3. 

dPo 

Let aj be the j th eigenvalue of B — Ip. The condition ||P — /p||op < 1/2 yields 
\aj\ < 1/2 and 

p 

2%{Po,Pi) = (3^B(3 + - log(l + aj)} 

i=i 

p 

< (3'^B(3 + Y,a] 

i=i 

< p^Bp + tr(P - /p)2 < p^Bp + Z\2. 

This implies by Pinsker’s inequality 

sup|iPo(^) - Fi{A)\ < y^3C(iPo,iPi) < ^^A^+l3'^Bf3 

as required. □ 


Notice that the operator norm bound 
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implies for B = S 

tr(B -/p)^ < pe^, p'^Bp<{l + €)\\(3f. 

Corollary F.2. Let Pq = }Nf( 6 ,5) and Pi = 'N{b'^,S^) for some non-degenerated 
matrices S and satisfying (F.2). Then 

sup\Po{A) - Pi{A)\ < \\/pe‘^ + (1 -h€)||/3||2 

For the special case with /3 = 0, we bound for any Borel set A C P^ 

\P{T{i) €A)- P{T{$^) G a) I < A/2. 

We state a separate corollary for the distribution of the maximum. 

Corollary F.3. Let two p-dimensional zero mean Gaussian vectors ^ ~ [N'(0,5) and 
~ [N’(0,5^) be given, and (F.l) holds. Then for any mapping T : P^ -A Pf* and any 
set of values {qrf), the random vectors X = T{^) and Y = T{^^) fulfill 

|iP(maxX^ “ <?r7 > O) — PimaxYri — (i'»7 > 0)| < A/2. 

Proof. We simply apply the result of the lemma to the set A = {x G P^: T[x) < z} . □ 
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